Code Translation
Translation is the process of turning source code into object code
- Source code is high level (statements and structure understood by humans)
- Object code uses a specific instruction set and registers (understood by the CPU)
Various special programs are involved in code translation
- Compiler - Translates source code and stores resulting object code in a file
- Interpreter - Translates source code and performs its actions directly
- Assembler - Translates low level mnemonics into object code stored in a file
- Linker - Combines object code files and library code into a final executable image
Older languages (eg. C, C++) use a compiler and linker in the traditional manner
Newer languages (eg. Java, Python) use both a compiler and interpreter together
Types of Code
![[Pasted image 20250519090611.png]]
Source code is portable (platform independent)
- Write code once as a text file
- Use a compiler that targets a specific platform
- Compiler outputs assembly code or object code that will only run on the platform for which it was compiled (platform dependent)
- The same source code can be used with another compiler on a different platform
- (eg. write a game in C++, compile it for PC, XBox, PlayStation, Switch, etc.)
Compilers vs Interpreters
Compilers
- Take a source program and translate it as a whole into machine code
- Translation and execution phases are separate
Interpreters - Take one line of code at a time, determine its meaning, then carry out that action
- Translation and execution phases are interlaced (interpreter is the executable)
- No stand-alone executable code is generated
Comparing Translation Methods
Each method has advantages and disadvantages
| Compilers | Interpreters |
|---|---|
| Translate entire file (with context) | Translate line by line (no look ahead) |
| Compilation process is fast | Interpretation is slow |
| Compiled code runs much faster | Interpreted code runs much slower |
| Run once (to make executable code) | Run every time source file is 'executed' |
| Check for syntax and semantic errors | Can only check for syntax errors |
| Programming tools can be complex | Programming environment is easier |
| Syntax errors hard to locate by line | Syntax errors easier to identify in code |
| Need careful organisation of source code | Useful for rapid prototyping |
| Debugging can be complicated | Debugging tools are easier to use |
| Generated object code is platform locked | Source code is portable across platforms |
New Languages
New programming languages are regularly created to solve niche problems
- Easier to design an interpreter and easier to code in an interpreted language
- So newer languages tend to be interpreted
There will always be a need for compilers and people who know how they work - Interpreter itself needs to run as an executable program
- A compiled language will be needed to write the code for the interpreter
It's possible (and often easier) to write an interpreter in another interpreted language - Even so, a compiler will be needed to make the interpreter for that language
- Something must be compiled into an executable so it can run on the CPU
Javac & Java
Java uses both a compiler (javac) and interpreter (java)
- Java development kit (JDK) must be installed on the programmer's system
- Source code is compiled into intermediate byte code
- Byte code is portable (platform independent)
- Java virtual machine (JVM) is an interpreter that processes the byte code
- Java runtime environment (JRE) must be installed on user's system
Programmer's System:
- Source Code
- javac compiles source code using the CPU into byte code (portable)
User's System: - Byte code interpreted with JVM using the CPU
Some other languages use a similar approach but it all happens on the user's system
- Python compiles source code into byte code and then immediately interprets it
- Happens without the user knowing the byte code exists (might see .pyc files)
Types of Compiler
Early compilers were single pass
- Moves through source code line by line and generates object code on the way
- Faster and uses less memory
- Cannot perform any optimisations
- Requires variables to be declared before use
Modern compilers are multi pass - Reads the entire source file into an internal data structure
- Analyses and augments the data structure in multiple stages
- Slower and requires more memory
- Can look at entire code to see where optimisations can be made
- Allows programmer to use (eg. on line 10) something defined later (eg. on line 30)
- Final pass reads the internal data structure and generates object code
Linking
Can be many source files each with a corresponding object file
Source files also reference library code which exists in multiple object files
The linker pulls all these together into the final executable program image
![[Pasted image 20250519090647.png]]
Note: Difference between static and dynamic linking
Components of Programming Languages
- Lexical component
- List of all legal keywords allowed in the language (eg. import, public, else, while)
- Meaning of each individual keyword (the action or effect it has)
- Syntactical component
- Defines the form and structure of legal expressions in the language
- For example,
double private x;is not legal, whereasprivate double x;is legal
- Semantic component
- Adds meaning to an expression made from keywords and syntax structures
- Defines what should be done when executing a particular fragment of code
- For example, the action to be taken when the code is an if-else fragment
- (test condition, execute if part when true, or else part when false)
Language Grammars
A precise description of the programming language is needed to the translator can:
- Analyse source code for correctness
- Apply optimisations where possible
- Generate correct and unambiguous object code
A grammar is a set of formal syntax rules - Complex rules are defined in terms of other rules
- Simple rules are defined in terms of elementary symbols or keywords
A source file is just a plain text file containing characters - Translator groups characters into tokens that represent keywords, brackets, etc.
- A token is the smallest possible group of characters that has a valid meaning
- Grammar rules define the correct order of tokens
Syntax Diagrams
Grammar rules can be drawn as a set of diagrams
![[Pasted image 20250519091205.png]]
Expressions are formed from one or more tokens
- Some expressions are non-terminal (defined by other rules)
- Other expressions are terminal (defined in their final form with simple characters)
Extended Backus-Naur Form
The translator uses a more precise grammar called extended Backus-Naur form (EBNF)
Defines production rules from non-terminal rules and terminal symbols
![[Pasted image 20250519091347.png]]
Terminal symbols are defined in quotes
The comma character concatenates things to build the rule from smaller parts
EBNF uses symbols to mean certain things (eg. curly brackets), so if those symbols are to be matched they must be in quotes (as terminal symbols)
Every aspect of the language must be defined in this way, so there will be hundreds of rules for any reasonably complex programming language (Java has around 120)
EBNF Notation (ISO Version)
There are several types of EBNF, however ISO EBNF is used here
| Syntax | Meaning |
|---|---|
| = | Defines a rule |
| | | OR |
| , | Concatenation |
| rule-name | Non-terminal rule reference |
| "symbol" | Terminal symbol |
| [item] | Optional item (zero or one occurence) |
| {item} | Repeated item (zero or more occurences) |
| (items) | Grouped items |
| {item}- | At least one repeated item (one or more occurences) |
| n*item | Repeated item (exactly n times) |
| ; | Rule terminator |
EBNF Examples
Grammar rules for numbers can be defined
digit = "0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9";
unsigned = {digit}-;
integer = unsigned | "+", unsigned | "-", unsigned;
fraction = ".", unsigned;
decimal = unsigned | fraction | unsigned, fraction;
Valid variable names can be defined
lower = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" |
"n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z";
upper = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" |
"N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z";
letter = lower | upper;
varname = letter, { letter | digit };
Eg. a variable must start with a letter, followed by any quantity of letters and digits in any combination
EBNF Matching
Given a string of characters, you can work our whether there is a valid match
- Valid:
zbc99,a9fg28p - Invalid:
7gbh8,89de4
If you wanted to also allow underscores in variable names:
varname = letter, { "_" | letter | digit };
Or also allow just one underscore at the start of a variable name:
varname = [ "_" ], letter, { "_" | letter | digit };
A compiler will try to match the source code against the EBNF grammar of the language
- High level code uses whitespace, brackets and semicolons to help provide structure
- Syntax errors generated if compiler cannot match every part of the source code
- Error messages often don't point to the exact line where the problem happened